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Functional MRI (fMRI) has become the most common method 
for investigating the human brain. However, fMRI data present some 
complications for statistical analysis and modeling. One recently de- 
veloped approach to these data focuses on estimation of computa- 
tional encoding models that describe how stimuli are transformed 
into brain activity measured in individual voxels. Here we aim at 
building encoding models for fMRI signals recorded in the primary 
visual cortex of the human brain. We use residual analyses to reveal 
systematic nonlinearity across voxels not taken into account by pre- 
vious models. We then show how a sparse nonparametric method [J. 
Roy. Statist. Soc. Ser. B 71 (2009b) 1009-1030] can be used together 
with correlation screening to estimate nonlinear encoding models ef- 
fectively. Our approach produces encoding models that predict about 
25% more accurately than models estimated using other methods 
[Nature 452 (2008a) 352-355]. The estimated nonlinearity impacts 
the inferred properties of individual voxels, and it has a plausible 
biological interpretation. One benefit of quantitative encoding mod- 
els is that estimated models can be used to decode brain activity, in 
order to identify which specific image was seen by an observer. En- 
coding models estimated by our approach also improve such image 
identification by about 12% when the correct image is one of 11,500 
possible images. 
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1. Introduction. One of the main differences between human brains and 
those of other animals is the size of the neocortex [Frahm, Stephan and 
Stephan (1982); Hofman (1989); Radic (1995); Van Essen (1997)]. Humans 
have one of the largest neocortical sheets, relative to their body weight, in 
the entire animal kingdom. The human neocortex is not a single undifferen- 
tiated functional unit, but consists of several hundred individual processing 
modules called areas. These areas are arranged in a highly interconnected, 
hierarchically organized network. The visual system alone consists of sev- 
eral dozen different visual areas, each of which plays a distinct functional 
role in vision. The largest visual area (indeed, the largest area in the entire 
neocortex) is the primary visual cortex, area VI. Because of its central im- 
portance in vision, area VI has long been a primary target for computational 
modeling. 

The most powerful tool available for measuring human brain activity is 
functional MRI (fMRI). However, fMRI data provide a rather complicated 
window on neural function. First, fMRI does not measure neuronal activ- 
ity directly, but rather measures changes in blood oxygenation caused by 
metabolic processes in neurons. Thus, fMRI provides an indirect and non- 
linear measure of neuronal activity. Second, fMRI has a fairly low temporal 
and spatial resolution. The temporal resolution is determined by physical 
changes in blood oxygenation, which are two orders of magnitude slower 
than changes in neural activity. The spatial resolution is determined by the 
physical constraints of the fMRI scanner (i.e, limits on the strength of the 
magnetic fields that can be produced, and limits on the power of the radio 
frequency energy that can be deposited safely in the tissue). In practice, 
fMRI signals usually have a temporal resolution of 1-2 seconds, and a spa- 
tial resolution of 2-4 millimeters. Thus, a typical fMRI experiment might 
produce data from 30,000-60,000 individual voxels (i.e., volumetric pixels) 
every 1-2 seconds. These data must first be filtered to remove nonstationary 
noise due to subject movement and random changes in blood pressure. Then 
they can be modeled and analyzed in order to address specific hypotheses 
of interest. 

One recent approach for modeling fMRI data is to use a training data 
set to estimate a separate model for each recorded voxel, and to test pre- 
dictions on a separate validation data set. In computational neuroscience 
these models are called encoding models, because they describe how infor- 
mation about the sensory stimulus is encoded in measured brain activity. 
Alternative hypotheses about visual function can be tested by comparing 
prediction accuracy of multiple encoding models that embody each hypoth- 
esis [Naselaris et al. (2011)]. Furthermore, estimated encoding models can 
be converted directly into decoding models, which can in turn be used to 
classify, identify or reconstruct the visual stimulus from brain activity mea- 
surements alone [Naselaris et al. (2011)]. These decoding models can be used 
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to measure how much information about specific stimulus features can be ex- 
tracted from brain activity measurements, and to relate these measurement 
directly to behavior [Raizada et al. (2010); Walther et al. (2009); Williams, 
Dang and Kanwisher (2007)]. 

Most encoding and decoding models rely on parametric regression meth- 
ods that assume the response is linearly related with stimulus features after 
fixed parametric nonlinear transformation(s). These transformations may 
be necessitated by nonlinearities in neural processes [e.g., Carandini, Heeger 
and Movshon (1997)], and other potential sources inherent to fMRI such as 
dynamics of blood flow and oxygenation in the brain [Buxton, Wong and 
Frank (1998); Buxton et al. (2004)] and other biological factors [Lauritzen 
(2005)]. However, it can be difficult to guess the most appropriate form of 
the transformation(s), especially when there are thousands of voxels and 
thousands of features, and when there may be different transformations for 
different features and different voxels. Inappropriate transformations will 
most likely adversely affect prediction accuracy and might also result in 
incorrect inferences and interpretations of the fitted models. 

In this paper we use a new, sparse and flexible nonparametric approach to 
more adequately model the nonlinearity in encoding models for fMRI voxels 
in human area VI. The data were collected in an earlier study [Kay et al. 
(2008a)]. The stimuli were grayscale natural images (see Figure 1). The orig- 
inal analysis focused on a class of models that included a fixed parametric 
nonlinear transformation of the stimuli, followed by linear weighting. Here we 
show by residual analysis that this model does not account for a substantial 
nonlinear response component (Section 4). We therefore model these data 
by a sparse nonparametric method [Ravikumar et al. (2009b)] after preselec- 
tion of features by marginal correlation. The resulting model qualitatively 
affects inferred tuning properties of VI voxels (Section 6), and it substan- 
tially improves response prediction (Section 4.2). The sparse nonparametric 
model also improves decoding accuracy (Section 5). We conclude that the 
nonlinearities found in the responses of voxels measured using fMRI impact 
both model performance and model interpretation. Although our paper fo- 
cuses entirely on area VI, our approach can be extended easily to voxels 
recorded in other areas of the brain. 

2. Background on VI. Brain area VI is located in the occipital cor- 
tex and is an early processing area of the visual pathway. It receives much 
of its input from the lateral geniculate nucleus — a small cluster of cells in 
the thalamus that is the brain's primary relay center for visual information 
from the eye. Many of the properties of VI neurons have been described 
by visual neuroscientists [see De Valois and De Valois (1990) for a sum- 
mary]. In most cases these neurons are described as spatio-temporal filters 
that respond whenever the stimulus matches the tuning properties of the 
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Fig. 1. Examples of natural image stimuli. The natural images used in the experiment 
were sampled from a large database of images obtained from a commercial digital library 
(Corel Stock Photo Libraries from Corel Corporation) . The images covered 20 x 20 degrees 
of the field of view, and were cropped to a circular aperture and blended into the background 
to reduce edge effects. 

filter. The important spatial tuning properties for VI neurons are related 
to spatial position, orientation and spatial frequency. Thus, each VI neuron 
responds maximally to stimuli that appear at a particular spatial location 
within the visual field, with a particular orientation and spatial frequency. 



ENCODING AND DECODING VI FMRI 



5 



Stimuli at different spatial positions, orientations and frequencies will elicit 
lower responses from the neuron. Because VI neurons are tuned for spatial 
position, orientation and spatial frequency they are often modeled as Gabor 
filters (whose impulse response is the product of a harmonic function and a 
Gaussian kernel) [De Valois and De Valois (1990)]. 

Although tuning for orientation and spatial frequency can be described 
using a linear filter model, it is well established that individual VI neurons do 
not behave exactly like linear filters. Studies using white noise stimuli have 
reported a nonlinear relationship between linear filter outputs and measured 
neural responses [e.g., Sharpee, Miller and Stryker (2008); Touryan, Lau and 
Dan (2002)]. Furthermore, it is known that the responses of VI neurons sat- 
urate (like yfx or logx) with increasing contrast [e.g., Albrecht and Hamilton 
(1982); Sclar, Maunsell and Lennie (1990)]. Finally, there is evidence that 
the responses of VI neurons are normalized by the activity of other neu- 
rons in their spatial or functional neighborhood. This phenomenon — known 
as divisive normalization — can account for a variety of nonlinear behaviors 
exhibited by VI neurons [Carandini, Heeger and Movshon (1997); Heeger 
(1992)]. It is reasonable to expect that the nonlinearities at the neural level 
will affect voxel responses evoked by natural images, so a statistical model 
should describe adequately these nonlinearities. 

3. The fMRI data. The data consist of fMRI measurements of blood 
oxygen level-dependent activity (or BOLD response) at m = 1,331 voxels 
in area VI of a single human subject [see Kay et al. (2008a)]. The voxels, 
measuring 2 x 2 x 2.5 millimeters, were acquired in coronal slices using a 4T 
INOVA MR (Varian, Inc., Palo Alto, CA) scanner, at a rate of 1Hz, over 
multiple sessions. Two sets of data were collected during the experiment: 
training and validation. During the training stage the subject viewed n = 
1,750 grayscale natural images randomly selected from an image database, 
each presented twice (but not consecutively) in a pseudorandom sequence; 
see Figure 1. Each image was presented in an ON-OFF-ON-OFF-ON pattern 
for 1 second with an additional 3 seconds OFF between presentations. For 
the validation data the subject viewed 120 novel natural images presented in 
the same way as in the training stage, but with a total of 13 presentations of 
each image. Data collection required approximately 10 hours in the scanner, 
distributed across 5 two hour sessions. 

Data preprocessing is necessary to correct several sampling artifacts 
that are intrinsic to fMRI. First, volumes were manually co-registered 
(in-house software) to correct for differences in head positioning across 
sessions. Slice-timing and automated motion corrections (SPM99, 
http://www.fil.ion.ucl.ac.uk/spm) were applied to volumes acquired 
within the same session. These corrections are standard and their details 
are explained in the supplementary information of Kay et al. (2008a). 
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Our encoding and decoding analyses depend upon denning a single scalar 
fMRI voxel response to each image. The procedures used to extract this 
scalar response from the BOLD time series measurements acquired during 
the fMRI experiment are described in the Appendix. In short, we assume 
that each distinct image evokes a fixed timecourse response, and that the 
response timecourses evoked by different images differ by only a scale factor. 
We use a model in which the response timecourses and scale factors are 
treated as separable parameters, and then use these scale factors as the 
scalar voxel responses to each image. By extracting a single scalar response 
from the entire timecourse, we effectively separate the salient image-evoked 
attributes of the BOLD measurements from those attributes due to the 
BOLD effect itself [Kay et al. (2008b)]. 

4. Encoding the VI voxel response. An encoding model that predicts 
brain activity in response to stimuli is important for neuroscientists who 
can use the model predictions to investigate and test hypotheses about the 
transformation from stimulus to response. In the context of fMRI, the voxel 
response is a proxy for brain activity, and so an fMRI encoding model pre- 
dicts voxel responses. Let Y v be the response of voxel v to an image stimulus 
S. We follow the approach of Kay et al. (2008a) and model the conditional 
mean response, 



as a function of local contrast energy features derived from projecting the 
image onto a 2D Gabor wavelet basis. These features are inspired by the 
known properties of neurons in VI, and are well established in visual neu- 
roscience [see, e.g., Adelson and Bergen (1985); Jones and Palmer (1987); 
Olshausen and Field (1996)]. A 2D Gabor wavelet g is the pointwise product 
of a complex 2D Fourier basis function and a Gaussian kernel: 



The basis we use is organized into 6 spatial scales/frequencies (w,<7i,a"2), 
where wavelets tile spatial locations (ao,&o) an d 8 possible orientations 9, for 
a total of p = (l 2 + 2 2 + 4 2 + 8 2 + 16 2 + 32 2 ) x 8 = 10,920 wavelets. Figure 2 
shows all of the possible scale and orientation pairs. 



H v (s):=E(Y v \S = s), 




where 



a = (a — ao) cos6> + (o — ao) sin6>, 
b = (b — bo) cos 8 — (b — bo) sin 9. 
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Fig. 2. Examples of Gabor wavelets. The basis used by the encoding model is orga- 
nized into 6 spatial scales (rows) and 8 orientations (columns). The imaginary part of 
the wavelets is not shown. 

Let gj denote a wavelet in the basis. The local contrast energy feature is 
defined as 



X 3 {s)-, 



a.b 



^Regj(a,b)s(a,b) + ^Imgj(a,b)s(a,b) 



a.b 



for j = 1, . . . ,p = 10,920. The feature set is essentially a localized version of 
the (estimated) Fourier power spectrum of the image. Each feature mea- 
sures the amount of contrast energy in the image at a particular frequency, 
orientation and location. 

4.1. Sparse linear models. The model proposed in Kay et al. (2008a) as- 
sumes that n v (s) is a weighted sum of a fixed transformation of the local 
contrast energy features. They applied a square root transformation to Xj 
to make the relationship between /i„(s) and the transformed features more 
linear. Thus, their model is 



(4.1) 



Xj(s) 



3=1 



We refer to (4.1) as the sqrt(X) model. Kay et al. (2008a) fit this model 
separately for each of the 1,331 voxels, using gradient descent on the squared 
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standardized fitted value 

Fig. 3. Residual and fitted values of model (4-1) for four different voxels (labeled above). 
The solid curves show a LOESS fit of the residual on the fitted values. 

error loss with early stopping [see, e.g., Friedman and Popescu (2004)], and 
demonstrated that the fitted models could be used to identify, from a large 
set of novel images, which specific image had been viewed by the subject. 
They used a simple decoding method that selects, from a set of candidates, 
the image s whose predicted voxel response pattern (ft v (s) :v = 1,2, ...) is 
most correlated with the observed voxel response pattern (Y v :v = 1,2, . . .). 
Although Kay et al. (2008a) focused on decoding, the encoding model is 
clearly an integral part of their approach. We found a substantial nonlinear 
aspect of the voxel response that their encoding sqrt(X) model does not 
take into account. 

Since the gradient descent method with early stopping is closely related 
to the Lasso method [Friedman and Popescu (2004)], we fit the model (4.1) 
separately to each voxel [as in Kay et al. (2008a)] using Lasso [Tibshirani 
(1996)], and selected the regularization parameters with BIC (using the 
number of nonzero coefficients in a Lasso model as the degrees of freedom). 
Figure 3 shows plots of the residuals and fitted values for four different 
voxels. With the aid of a LOESS smoother [Cleveland and Devlin (1988)], 
we see a nonlinear relationship between the residual and the fitted values. 
This pattern is not unique to these four voxels. We extended this analysis 
to all 1,331 voxels. By standardizing the fitted values, we can overlay the 
smoothers for all 1,331 voxels and inspect for systematic deviations from 
the sqrt(X) model across all voxels. Figure 4 shows the result. Nonlinearity 
beyond the sqrt(X) model is present in almost all voxels, and, moreover, 
the residuals appear to be heteroskedastic. 

Composing the square root transformation with an additional nonlinear 
transformation could absorb some of the residual nonlinearity in the sqrt(X) 
model. Instead of the square root, log(l + y/x) was used by Naselaris et al. 
(2009) to analyze the same data set as we do in this paper and it has also 
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Fig. 4. Residual and standardized fitted values of model (4-1) blended across all 1,331 
voxels. The solid curves show the LOESS fits of the residuals on the fitted values for each 
voxel. 

been used in other applications [see Kafadar and Wegman (2006) for an 
example in the analysis of internet traffic data] . The resulting model is 

p , 

(4.2) fi v (s) = p v0 + Atf Ml + \f x j(s)), 

and we refer to it as the log(l + sqrt(X)) model. 

We fit model (4.2) using Lasso with BIC, and compared its prediction 
performance with model (4.1) by evaluating the squared correlation (pre- 
dictive R 2 ) between the predicted and actual response across all 120 images 
in the validation set. Figure 5 shows the difference in predictive R 2 values 
of the two models for each voxel. There is an improvement in prediction 
performance (median 5.5% for voxels where both models have an R 2 > 0.1) 
with model (4.2). However, examination of residual plots (not shown) reveals 
that there is still residual nonlinearity. 

4.2. Sparse additive (non-parametric) models. The ^fx and log(l + ^fx) 
transformations were used in previous work to approximate the contrast 
saturation of the BOLD response. Rather than trying other fixed transfor- 
mations to account for the nonlinearities in the voxel response, we employed 
a sparse nonparametric approach that is based on the additive model. The 
additive model [cf. Hastie and Tibshirani (1990)] is a useful generalization 
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Fig. 5. Comparison of voxel-wise predictive R 2 (based on the validation data) of the 
log(l + sqrt(X)) model (4-2) and the sqrt(X) model (4-1)- The vertical axis shows the 
difference R 2 of (4-2) —R 2 of (4-1)- The median improvement of model (4-2) is 5.5% for 
voxels where both models have a predictive R 2 > 0.1. 



of the linear model that allows the feature transformations to be estimated 
from the data. Rather than assuming that the conditional mean [i is a linear 
function (of fixed transformations) of the features, the additive ( nonpar a- 
metric) model assumes that 



v 



(4.3) l* = Po + ±^M X i)> 

j"=i 

where /,• G T~Lj are unknown, mean predictor functions in some Hilbert 
spaces T~Lj. The linear model is a special case where the predictor functions 
are assumed to be of the form fj(x) = (3jX. The monograph of Hastie and 
Tibshirani describes methods of estimation and algorithms for fitting (4.3), 
however, the setting there is more classical in that the methods are most 
appropriate for low-dimensional problems (small p, large n). 

Ravikumar et al. (2009b) extended the additive model methodology to 
the high-dimensional setting by incorporating ideas from the Lasso. Their 
sparse additive model (SPAM) adds a sparsity assumption to (4.3) by as- 
suming that the set of active predictors {j : fj ^ 0} is sparse. They propose 
fitting (4.3) under this sparsity assumption by minimization of the penalized 
squared error loss 

2 

(4.4) min 



v-A)i-x;/i(Xi) 



v 



where || • || is the Euclidean norm in W 1 , Y is the n- vector of sample re- 
sponses, 1 is the vector of l's, /j(Xj) is the vector obtained by applying 
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fj to each sample of Xj, and A > 0. The penalty term, A^j=i ||/j(Xj)||, is 
the functional equivalent of the Lasso penalty. It simultaneously encourages 
sparsity (setting many fj to zero) and shrinkage of the estimated predictor 
functions by acting as an LI penalty on the empirical L2 function norms 
||/j(Xj)||, j = 1, . . . ,p. The algorithm proposed by Ravikumar et al. (2009b) 
for solving the sample version of the SPAM optimization problem (4.4) is 
shown in Figure 6. It generalizes the well-known backfitting algorithm [Fried- 
man and Stuetzle (1981)] by incorporating an additional soft-thresholding 
step. The main bottleneck of the algorithm is the complexity of the smooth- 
ing step. 

We did not apply SPAM directly to the feature Xj(s), but instead applied 
it to the transformed feature, log(l + y/Xj(s)). We refer to the model 

(4.5) fi v (s) = fa +^/^(log(l + ^Xjis))) 

as V-SPAM — "V" for visual cortex and VI neuron-inspired features. There 
is no loss in generality of this model when compared with (4.3), but there is 
a practical benefit because the log(l + yOfj(s)) feature tends to be bet- 
ter spread out than the Xj(s) feature. This has a direct effect on the 
smoothness of f V j. Although we did not try other transformations, we found 
that applying the SPAM model directly to the Xj(s) features rather than 
log(l + y/Xj(s)) resulted in poorer fitting models. 

We fit the V-SPAM model separately to each voxel, using cubic spline 
smoothers for the f V j. We placed knots at the deciles of the log(l + \/X~j) 
feature distributions and fixed the effective degrees of freedom [trace of the 
corresponding smoothing matrix; cf. Hastie and Tibshirani (1990)] to 4 for 
each smoother. This choice was based on examination of a few partial resid- 
ual plots from model (4.2) and comparison of smooths for different effec- 
tive degrees of freedoms. We felt that optimizing the smoothing parameters 



Input: Sample vectors (Y, Xi,...,X p ), smoothers (smoothi, . . . , smooth^), and regular- 
ization parameter (A > 0) 

fj <-0 for j = l,...,p 
repeat 

for j = 1 to p do 

R 3 Y — Pol — X^fc^j fk (Xk) — compute the partial residual 
Sj smooth., -(Rj) 

fj <— Sj(l — A/||sj||)+ — soft-threshold 
end for 

until RSS = ||Y - /3 1 - Ej /j( x j)I| 2 converges 

return estimated intercept /3o and predictor functions f% , f%, . . . , f p 



Fig. 6. The SPAM backfitting algorithm. 
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Fig. 7. Residual and fitted values of V-SPAM (4-5) for four different voxels (labeled 
above). The solid curves show a LOESS fit of the residual on the fitted values. Compare 
with Figure 3. The linear trend in the residuals is due to the shrinkage effect of the penalty 
in the SPAM criterion (4-4)- 

across features and voxels (with generalized cross-validation or some other 
criterion) would add too much complexity and computational burden to the 
fitting procedure. 

The amount of time required to fit the V-SPAM model for a single voxel 
with 10,920 features is considerably longer than for fitting a linear model, 
because of the complexity of the smoothing step. So for computational rea- 
sons we reduced the number of features to 500 by screening out those that 
have low marginal correlation with the response, which reduced the time to 
fit one voxel to about 10 seconds. 8 We selected the regularization parameter 
A using BIC with the degrees of freedom of a candidate model defined to be 
the sum of the effective degrees of freedom of the active smoothers (those 
corresponding to nonzero estimates of fj). 

Figure 7 shows residual and fitted value plots for the four voxels that 
we examined in the previous section. Little residual nonlinearity remains 
in this aspect of the V-SPAM fit. The residual linear trend in the LOESS 
curve is due to the shrinkage effect of the SPAM penalty — the residuals of a 
penalized least squares fit are necessarily correlated with the fitted values. 
Figure 8 shows the residuals and fitted values of V-SPAM for all 1,331 voxels. 
In contrast to Figure 4, there is neither a visible pattern of nonlinearity, nor 
a visible pattern of heteroskedasticity. 

The V-SPAM model better addresses nonlinearities in the voxel response. 
To determine if this model leads to improved prediction performance, we ex- 
amined the squared correlation (predictive R 2 ) between the predicted and 



8 Timing for an 8-core, 2.8 GHz Intel Xeon-based computer using a multithreaded linear 
algebra library with software written in R. 
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Fig. 8. Residual and standardized fitted values of V-SPAM (4-5) for all 1,331 voxels. 
The solid curves show the LOESS fits of the residuals on the fitted values for each voxel. 
Compare with Figure 4- 

actual response across all 120 images in the validation set. Figure 9 com- 
pares the predictive R 2 of the V-SPAM model for each voxel with those 
of the sqrt(X) model (4.1) and the log(l + sqrt(X)) model (4.2). Across 
most voxels, there is a substantial improvement in prediction performance. 
The median (across voxels where both models have a predictive R 2 > 0.1) is 
26.4% over the sqrt(X) model, and 19.9% over the log(l + sqrt(X)) model. 
Thus, the additional nonlinear aspects of the response revealed in the resid- 
ual plots (Figures 3 and 4) for the parametric sqrt(X) and log{\ + sqrt(X)) 
models are real and they account for a substantial part of the prediction of 
the voxel response. 

5. Decoding the VI voxel response. Decoding models have received a 
great deal of attention recently because of their role in potential "mind 
reading" devices. Decoding models are also useful from a statistical point of 
view because their results can be judged directly in the known and controlled 
stimulus space. Here we show that accurately characterizing nonlinearities 
with the V-SPAM encoding model (presented in the preceding section) leads 
to substantially improved decoding. 

We used a Naive Bayes approach similar to that proposed by Naselaris 
et al. (2009) to derive a decoding model from the V-SPAM encoding model. 
Recall that Y v (v = 1, . . . , m and m = 1,331) is the response of voxel v to 
image S. A simple model for Y v that is compatible with the least squares 
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0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 

Predictive R 2 of sqrt(X) model Predictive R 2 of log(1 +sqrt(X)) model 

(b) 

Fig. 9. Comparison of voxel-wise predictive R 2 (based on the validation data) of the 
sqrt(X) model (4.1), the log(l + sqrt(X)) model (4.2) and V-SPAM (4.5). (a) Histograms 
of the predictive R 2 value across voxels. They are displayed sideways to ease comparison. 
(b) Difference of predictive R 2 values of V-SPAM (4-5): (left) sqrt(X) model (4-1)', (fight) 
log{\ + sqrt(X)) model (4.2). 

fitting in Section 4 assumes that the conditional distribution of Y v given 
S is Normal with mean ij, v (S) and variance cr^, and that Y\,...,Y m are 
conditionally independent given S. To complete the specification of the joint 
distribution of the stimulus and response, we take an empirical approach 
[Naselaris et al. (2009)] by considering a large collection of images B similar 
to those used to acquire training and validation data. The bag of images 
prior places equal probability on each image in B: 



P(5 = s) = 



if s e B, 
otherwise. 
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This distribution only implicitly specifies the statistical structure of natural 
images. With Bayes' rule we arrive at the decoding model 

/ I \ / (yv-(*v(s)) 2 \ 
p{s\yi, . . . ,y s ) (x expi- ^2 J xF ( s = s )- 

This model suggests that we can identify the image s that most closely 
matches a given voxel response pattern (Y~i, . . . , Y rn ) by the rule 

m ^ 

(5.1) argmaxp(s|yi, ...,y s ) = arg min V] (y„ - /^(s)) 2 . 

The fitted models from Section 4 provide estimates of \i v . Given fi v , the 
variance a 2 can be estimated by 



A,(S)|| 2 



n - di(fi v ) 

where di(fi v ) is the degrees of freedom of the estimate £i v (the number of 
nonzero coefficients in the case of linear models, or 4 times the number of 
nonzero functions in the case of V-SPAM; cf. Section 4.2). Substituting these 
estimates into (5.1) gives the decoding rule 

m . 

argmin V" (y v - £„(s)) 2 - 
sets ^ x a% 

Although we have estimates for every voxel, not every voxel may be useful 
for decoding — fi v may be a poor estimate of \x v or fi v (s) may be close to 
constant for every s. In that case, we may want to select a subset of voxels 
V C {1, . . . , to} and restrict the summation in the above display to V. Thus, 
we propose the decoding rule 

(5.2) Sy(yi,..., y m \B) = argmin V" xoO/u - frv(s)) 2 . 

One strategy for voxel selection is to set a threshold a for entry to V based 
on the usual R 2 computed with the training data, 

l \Y v - fi v (S)\\ 2 



(5.3) training R (v) = 1 



IV — V 

-■-7? 1 7 



so that V a = {v : training R 2 (v) > a}. We will examine this strategy later in 
the section. 

To use (5.2) as a general purpose decoder, the collection of images B 
should ideally be large enough so that every natural image S is "well- 
approximated" by some image in B. This requires a distance function over 
natural images in order to formalize "well-approximate," but it is not clear 
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what the distance function should be. We consider instead the following 
paradigm. Suppose that the image stimulus S that evoked the voxel re- 
sponse pattern is actually contained in B. Then it may be possible for (5.2) 
to recover S exactly. This is the basic premise of the identification problem 
where we ask if the decoding rule can correctly identify S from a set of can- 
didates BU{S}. Within this paradigm, we assess (5.2) by its identification 
error rate, 

(5.4) id error rate := F(§v(Y{, . . . , Y^\B U {S'}) ^ S'\S V (- ■ ■)), 

on a future stimulus and voxel response pair {S' , (Y{, . . . , Y^J} that is inde- 
pendent of the training data. 

The identification error rate should increase as \B\ = b increases. However, 
the rate at which it increases will depend on the model used for estimating 
p, v . We investigated this by starting with a database V of 11,499 images 
(as in Figure 1) that are similar to, but do not include, the images in the 
training data or validation data, and then repeating the following experiment 
for different choices of b: 

(1) Form B by drawing a sample of size b without replacement from D. 

(2) Estimate the identification error rate (5.4) using the 120 stimulus and 
voxel response pairs {S' , (Y{, . . . , Y^)} in the validation data. 

(3) Average the estimated identification error rate over all possible B^T> 
of size b. 

The average identification error rate can be computed without resorting to 
Monte Carlo. Given {S' , (Y{, . . . , Y^)}, 

(5.5) S V (Y(,...,Y^\BU{S'}) = S' 
if and only if 

(5-6) £ ±04 - MS)) 2 < £ ±M - A,( S )) 2 

for every s E B. Since B is drawn by a simple random sample, the number 
of times that event (5.6) occurs follows a hyper geometric distribution. So 
the conditional probability that (5.5) occurs is just the probability that 
a hypergeometric random variable is equal to b. The parameters of this 
hypergeometric distribution are given by the number of images in T> that 
satisfy (5.6), the number of images in V that do not satisfy (5.6), and b. 
Counting the number of images in V that satisfy/do not satisfy (5.6) is easy 
and only has to be done once for each S in the validation data, regardless of 
b. Thus, the computation involves evaluating (5.6) 120 x 11,499 times (since 
there are 120 images in the validation data and 11,499 images in D), and 
then evaluating 120 hypergeometric probabilities for each b. 
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Fig. 10. Estimated average identification error rate (5.4) as a function of the number of 
possible images (\B\ + 1)- The error rates were estimated using the validation data and B 
randomly sampled from a database of 11,499 images. 



Figure 10 shows the results of applying the preceding analysis to the fixed 
transformation models (4.1) and (4.2) and the V-SPAM model (4.5). Each 
model has its own subset of voxels V used by the decoding rule. We set the 
training R 2 thresholds (5.3) so that the corresponding decoding rule used 
|V| = 400 voxels for each model. When \B\ is small, identification is easy 
and all three models have very low error rates. As the number of possible 
images increases, the error rates of all three models increase but at different 
rates. At maximum, when B = T> and there are 11,499 + 1 = 11,500 candidate 
images (11,499 images in D plus 1 correct image not in T>) for the decoding 
rule to choose from, the fixed transformation models have an error rate of 
about 40%, while the V-SPAM model has an error rate of about 28%. 

The ordering of and large gap between the fixed transformation models 
and V-SPAM at maximum does not depend on our choice of | V| = 400 vox- 
els. Fixing B = T> so that the number of possible images is maximal, we 
examined how the identification error rate varies as the training R 2 thresh- 
old is varied. Figure 11 shows our results. The threshold corresponding to 
400 voxels is larger for V-SPAM than the fixed transformation models. It is 
about 0.1 for V-SPAM and 0.05 for the fixed transformation models. When 
the threshold is below 0.05, the error rates of the three models are indistin- 
guishable. Above 0.05, V-SPAM generally has a much lower error rate than 
the fixed transformation models. In panel (a) of Figure 11 we also see that 
V-SPAM can achieve an error rate lower than the best of the fixed trans- 
formation models with half as many voxels (< 200 versus > 400). These 
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Fig. 11. Identification error rate (5.4) as a function of the training R threshold (5.3) 
when the number of possible images is 11,499 + 1. (a) Estimated identification error rate. 
The solid circles on each curve mark the points where the number of voxels used by the 
decoding rule is (from left to right) 400, 200 or 100. (b) Pointwise 95% confidence bands 
for the difference between the identification error rates of (upper) sqrt(X) model (4-1) and 
V-SPAM; (lower) log(l + sqrt(X)) model (4.2) and V-SPAM. The confidence bands reflect 
uncertainty due to sampling variation of the validation data. 



results show that the substantial improvements in voxel response prediction 
by V-SPAM can lead to substantial improvements in decoding accuracy. 
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6. Nonlinearity and inferred tuning properties. In computational neu- 
roscience, the tuning function describes how the output of a neuron or voxel 
varies as a function of some specific stimulus feature [Zhang and Sejnowski 
(1999)]. As such, the tuning function is a special case of an encoding model, 
and once an encoding model has been estimated, a tuning function can be 
extracted from the model by integrating out all of the stimulus features ex- 
cept for those of interest. In practice, this extraction is achieved by using 
an encoding model to predict responses to parametrized, synthetic stimuli. 
One way to assess the quality of an encoding model is to inspect the tuning 
functions that are derived from it [Kay et al. (2008a)]. 

For vision, the most fundamental and important kind of tuning function 
is the spatial receptive field. Each neuron (or voxel) in each visual area is 
sensitive to stimulus energy presented in a limited region of visual space, 
and spatial receptive fields describe how the response of the neuron or voxel 
is modulated over this region. In the primary visual cortex, response mod- 
ulation is typically strongest at the center of the receptive field. Response 
modulation is much weaker at the periphery, but has been shown to have 
functionally significant effects on the output of the neuron (or voxel) [Vinje 
and Gallant (2000)]. 

The panels in Figure 12 show estimated spatial receptive fields for voxel 
717 using the three different models considered here [we chose this voxel 
because its predictive R 2 varied greatly among the three models: 0.26 for 
the sqrt(X) model (4.1), 0.42 for the log{\ + sqrt(X)) (4.2), and 0.57 for V- 
SPAM (4.5)]. These estimated receptive fields indicate the locations within 
the spatial field of view that are predicted to modulate the response of the 
voxel by each model. All three models agree that the voxel is tuned to a 
region in the lower-right quadrant of the field of view; however, for V-SPAM 
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Fig. 12. Estimated spatial receptive field for voxel 717. The contours show the predicted 
response to a point stimulus placed at various locations across the field of view. They 
indicate the sensitivity of the voxel to different spatial locations. 
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Fig. 13. Estimated frequency and orientation tuning for voxel 717. The contours show 
the predicted response to a 2D cosine stimulus (a 2D Fourier basis function) parameterized 
by frequency and orientation. Darker regions correspond to greater predicted responses. The 
plot reveals sensitivity of the voxel to different spectral components. 



the receptive field is more expansive, and is thus able to capture the weak 
but potentially important responses at the far periphery of the visual field. 

Like spatial tuning, orientation and frequency tuning are fundamental 
properties of VI, so it is essential to inspect the orientation and frequency 
tuning functions that are derived from encoding models for this area. As 
seen in the panels of Figure 13, the V-SPAM model is better able to capture 
the weaker responses to orientations and spatial frequencies away from the 
peaks of the tuning. 

Finally, we examine tuning to image contrast, which is another critical 
property of VI. Image contrast strongly modulates responses in VI and is 
also perceptually salient, so contrast tuning functions are frequently used 
to study the relationship between activity and perception [Olman et al. 
(2004)]. The contrast tuning function describes how a voxel is predicted 
to respond to different contrast levels. It is constructed by computing the 
predicted response to a stimulus of the form t ■ w, where w is standardized 
2D pink noise (whose power spectral density is of the form 1/|cj|), and 
t > is the root-mean-square (RMS) contrast. At zero contrast the noise is 
invisible and only the background can be seen; as contrast increases the noise 
becomes more visible and distinguishable from the background. Figure 14 
shows the contrast response function for the voxel as estimated by the three 
models. The first two, the sqrt(X) and log{\ + sqrt(X)), look nearly linear 
and relatively flat over the range of contrasts present in the training images. 
The V-SPAM prediction tapers off as contrast increases, and it is much more 
negative for low contrasts than predicted by sqrt(X) and log{\ + sqrt(X)). 
The V-SPAM prediction is closer to what is expected based on previous 
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Fig. 14. Estimated contrast tuning function for voxel 717. This is the predicted response 
to a pink noise stimulus at different levels of RMS contrast t. The tick marks indicate the 
deciles of RMS contrast in the training images (e.g., fewer than 10% of training images 
have contrast between 2 and 4). 



direct measurements [Olman et al. (2004)], and suggests that V-SPAM is 
more sensitive to responses evoked by lower contrast stimulus energy. 

The relatively more sensitive tuning functions derived from the V-SPAM 
model of voxel 717 have a simple explanation. The models selected by BIC for 
this voxel included different numbers of features: 7 for sqrt(X), 29 for log(l + 
sqrt(X)), and 53 for V-SPAM. Since the features are localized in space, 
frequency, and orientation, the number of features in the selected model is 
related to the sensitivity of the estimated tuning functions in the periphery. 
BIC forces a trade-off between the residual sum of squares (RSS) and number 
of features. The models with fixed transformations have much larger RSS 
values than V-SPAM, and the trade-off (see Figure 15) favors fewer features 
for them because the residual nonlinearity (as shown in Figure 3) does not go 
away with increased numbers of features. This suggests that the sensitivity 
of a voxel to weaker stimulus energy is not detected by the sqrt(X) and 
log(l + sqrt(X)) models, because it is masked by residual nonlinearity. So 
the tuning function of a voxel can be much broader than inferred by the 
model when the model is incorrect. 



7. Conclusion. Using residual analysis and a start-of-the-art sparse ad- 
ditive nonparametric method (SPAM), we have derived V-SPAM encoding 
models for VI fMRI BOLD responses to natural images and demonstrated 
the presence of an important nonlinearity in VI fMRI response that has not 
been accounted for by previous models based on fixed parametric nonlinear 
transforms. This nonlinearity could be caused by several different mecha- 
nisms including the dynamics of blood flow and oxygenation in the brain 
and the underlying neural processes. By comparing V-SPAM models with 
the previous models, we showed that V-SPAM models can both improve 
substantially prediction accuracy for encoding and decrease substantially 
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Fig. 15. Comparison of BIC paths for different models of voxel 717: the sqrt(X) model 
(4.1), the log(l + sqrt(X)) model (4.2), and V-SPAM (4.5). 



identification error when decoding from very large collections of images. We 
also showed that the deficiency of the previous encoding models with fixed 
parametric nonlinear transformations also affects tuning functions derived 
from the fitted models. 

Since encoding and decoding models are becoming more prevalent in fMRI 
studies, it is important to have methods to adequately characterize the non- 
linear aspects of the response-stimulus relationship. Failure to address non- 
linearity effectively can lead to suboptimal predictions and incorrect infer- 
ences. The methods used here, combining residual analysis and sparse non- 
parametric modeling, can easily be adopted by neuroscientists studying any 
part of the brain with encoding and decoding models. 

APPENDIX: EXTRACTING THE FMRI BOLD RESPONSE 

The fMRI signal Z v (t) measured at voxel v can be modeled as a sum of 
three components: the BOLD signal B v (t), a nuisance signal N v (t) (consist- 
ing of low frequency fluctuations due to scanner drift, physiological noise, 
and other nuisances), and noise £ v (t): 

Z v (t) = B v (t) + N v (t) + e v (t). 

The BOLD signal is a mixture of evoked responses to image stimuli. This 
reflects the underlying hemodynamic response that results from neuronal 
and vascular changes triggered by an image presentation. The hemodynamic 
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Fig. 16. A model hemodynamic response function. 



response function h v (t) characterizes the shape of the BOLD response (see 
Figure 16), and is related to the BOLD signal by the linear time invariant 
system model [Friston, Jezzard and Turner (1994)], 

n 

B v{t) = J2Yl A v{k)K{t-r), 

k=l T£T k 

where n is the number of images, is the set of times at which image k is 
presented to the subject, and A v (k) is the amplitude of the voxel's response 
to image k. 

To extract A v {-) from the fMRI signal, it is necessary to estimate the 
hemodynamic response function and the nuisance signal. We used the method 
described in Kay et al. (2008b), modeling h v (t) as a linear combination of 
Fourier basis functions covering a period of 16 seconds following stimulus 
onset, N v (t) as a degree 3 polynomial, and e v (t) as a first-order autoregres- 
sive process. The resulting estimates A v {-) are the voxel responses for each 
image. 
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